EpiClass accurately predicts EpiATLAS assay and biospecimen metadata

Author

Joanny Raby

Results section 2 figures

Formatting of the figures may not be identical to the paper, but they contain the same data points.

All code is folded by default. For any given cell, click on “Code” to expand it, or unfold all using code options beside the main title, above.

Some code may be repeated, as the original python notebook was written for figures to be created semi-independantly (so one can fully re-run individual sections when making modifications).

Important

THIS IS A WORK IN PROGRESS.

Some figures are still missing. Message date: 2025-10-01

Article text

Building on the successful classification of assay and biospecimen attributes, we extended our approach to six additional categories of harmonized metadata from EpiATLAS (donor sex, sample cancer status, donor life stage, biomaterial type, paired-end sequencing status and data provider (consortium)). These new classifiers demonstrated similarly impressive performance, achieving accuracy above 95%, F1-scores exceeding 92% (except for life stage), and AUROC values greater than 98% (Fig. 2A-B, Supplementary Fig. 5A, Supplementary Table 2). The F1-score variations among categories primarily reflected underlying class imbalances, quantified through normalized Shannon entropy (Fig. 2C, Supplementary Table 1). For instance, in the life stage category, fetal and newborn classes contained only 44 and 106 datasets respectively, compared to over 5,000 adult datasets. Similarly, in the sex category, the mixed class included just 114 datasets versus more than 3,400 for both female and male classes. Despite these class imbalance challenges, the robust classifiers performance enabled us to augment with high-confidence over 85% of previously missing donor sex and life stage labels in EpiATLAS v2.0 metadata, endorsed by the presence of multiple assays per biological sample. Moreover, the performances of all classifiers improve by increasing the prediction score threshold, prompting us to consider prediction scores as a confidence score hereafter (Supplementary Fig. 6A).

To support the sex predictions, we leveraged chromosome Y (chrY) signal as a complementary validation metric given that it was excluded during training. As expected, this signal showed distinct patterns between male and female samples [28], where males exhibited higher average methylation signals on chrY than females, with the exception of WGBS (Supplementary Fig. 5B). Moreover, progressively increasing the prediction confidence threshold resulted in a corresponding increase in the average z-score separation between the clusters, indicating that the prediction score aligns well with the strength of this independent biological sex marker (Supplementary Fig. 5C). Through this approach, we confidently assigned sex metadata to 268 unlabeled biological samples, reducing unknown cases to 2% (Fig. 2D, Supplementary Table 5). This analysis also revealed 23 previously mislabeled samples, which were corrected in EpiATLAS v2.0 (Fig. 2E, Supplementary Table 5).

We achieved similarly strong results for life stage classification, confidently assigning metadata labels to 404 biological samples, with only 5% remaining unknown (Supplementary Table 5). To support these predictions, the WGBS data was provided to the machine learning GP-age tool,18 showing expected trends between its predicted age and the life stage categories across both previously unknown (n = 145) and all samples (n = 565) (Fig. 2F, Supplementary Fig. 5D). Notably, blood-related samples showed tighter correspondence between GP-age predictions and life stage categories, with more distinct age distributions across the perinatal, pediatric, and adult groups. This improved alignment is consistent with GP-age’s training background which is made of blood samples. This analysis also identified 17 mislabeled biological samples, which were also corrected in EpiATLAS v2.0 (Supplementary Table 5).

Using Shapley additive explanations (SHAP) [29], we identified genomic regions that most significantly influence model predictions for Biospecimen classification. SHAP values quantify each feature’s contribution to model decisions by measuring its impact on predictions compared to expected values. This approach revealed an average of 187 important genomic bins driving the Biospecimen classifier decisions (Supplementary Table 6). Interestingly, their potential biological relevance is supported by two aspects: 1) their elevated functional potential compared to all regions of the genome, quantified using the ChromActivity computational framework [26,30] (Fig. 2G, Supplementary Fig. 7), and 2) their biospecimen-specific enriched gene ontology terms that are well aligned with expected ones, also supported by the Epilogos visualization [31] of a representative region (Fig. 2H-I, Supplementary Table 7).

We also identified 503 SHAP-based important regions for the Sex classifier and 336 for the Cancer status one (Supplementary Table 6). Unsurprisingly, ~30% of the important regions for sex classifications were on chrX (5.8 fold enrichment), including the known sex-influenced genes XIST [32] and FIRRE [33] (P = 3.30E-3, Methods) (Supplementary Fig. 5E), with X-linked inheritance (HP:0001417) emerging as the most significant term from the 643 protein-coding genes in these regions (P = 3.40E-55) (Supplementary Table 7). Notably, 78.6% (22/28) of the features overlapping the pseudoautosomal region 1 (PAR1) were among the important regions, a highly significant enrichment (P = 2.55E-18). The important regions from the Cancer status classifier showed a much high enrichment (z-score > 3) in the copy number (CN) alteration signatures associated with focal events such as the tandem duplicator phenotype (CN17), complex patterns (CN18-21) and focal loss of heterozygosity (CN9-12) compared to chromosomal events (Fig. 2I, Supplementary Table 8) [34]. Enriched gene ontology terms from these same important regions are related to chromatin structure (GO:0030527, P = 3.38E-23) and histone deacetylation (REAC:R-HSA-3214815, P = 2.60E-19) (Supplementary Table 7).

Collectively, these results demonstrate that EpiClass can be used to confidently validate and enrich the metadata of associated data, and suggest that the regions identified as important by the different classifiers contain biologically relevant information.

Setup Code - Imports and co.

Setup imports.

Code
from __future__ import annotations

import itertools
import re
import tarfile
from pathlib import Path
from typing import Dict, List, Sequence

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display
from plotly.subplots import make_subplots
from scipy.stats import zscore

from epiclass.utils.notebooks.paper.paper_utilities import (
    ASSAY,
    ASSAY_MERGE_DICT,
    ASSAY_ORDER,
    BIOMATERIAL_TYPE,
    CELL_TYPE,
    LIFE_STAGE,
    SEX,
    IHECColorMap,
    MetadataHandler,
    SplitResultsHandler,
    create_mislabel_corrector,
    PathChecker
)

CORE7_ASSAYS = ASSAY_ORDER[0:7]

Setup paths.

Code
# Root path
base_dir = Path.home() / "Projects/epiclass/output/paper"
PathChecker.check_directory(base_dir)

# More precise
base_data_dir = base_dir / "data"
base_fig_dir = base_dir / "figures"
table_dir = base_dir / "tables"
metadata_dir = base_data_dir / "metadata"

official_metadata_dir = metadata_dir / "epiatlas" / "official"
PathChecker.check_directory(official_metadata_dir)

# alias
paper_dir = base_dir

Setup colors.

Code
IHECColorMap = IHECColorMap(base_fig_dir)
assay_colors = IHECColorMap.assay_color_map
cell_type_colors = IHECColorMap.cell_type_color_map
sex_colors = IHECColorMap.sex_color_map

Setup metadata and prediction files handlers.

Code
split_results_handler = SplitResultsHandler()

metadata_handler = MetadataHandler(paper_dir)
metadata_v2 = metadata_handler.load_metadata("v2")
metadata_v2_df = metadata_v2.to_df()

Setup figures general settings.

Code
main_title_settings = {
    "title":dict(
        automargin=True,
        x=0.5,
        xanchor="center",
        yanchor="top",
        y=0.98
        ),
    "margin":dict(t=50, l=10, r=10)
}

Figure 2

A,B - MLP performance on various classification tasks (100kb resolution)

  • 2A: Accuracy
  • 2B: F1-Score
  • Supp 5A: AUC scores

Define function that graphs performance accross classification tasks.

Code
def NN_performance_across_classification_tasks(
    split_metrics: Dict[str, Dict[str, Dict[str, float]]],
    name: str | None = None,
    logdir: Path | None = None,
    exclude_categories: List[str] | None = None,
    y_range: List[float] | None = None,
    sort_by_acc: bool = False,
    metric_names: Sequence[str] = ("Accuracy", "F1_macro"),
    title: str | None = None,
) -> List[str]:
    """Render box plots of metrics per classifier and split, each in its own subplot.

    This function generates a figure with subplots, each representing a different
    metric. Each subplot contains box plots for each classifier, ordered by accuracy.

    Args:
        split_metrics: A nested dictionary with structure {split: {classifier: {metric: score}}}.
        logdir: The directory path to save the output plots. If None, only display the plot.
        name: The base name for the output plot files.
        exclude_categories: Task categories to exclude from the plot.
        y_range: The y-axis range for the plots.
        sort_by_acc: Whether to sort the classifiers by accuracy.
        metric_names: The metrics to include in the plot.

    Returns:
        The list of classifier names in the order they appear in the plot.
    """
    # Exclude some categories
    classifier_names = list(split_metrics["split0"].keys())
    if exclude_categories is not None:
        for category in exclude_categories:
            classifier_names = [c for c in classifier_names if category not in c]

    available_metrics = list(split_metrics["split0"][classifier_names[0]].keys())
    try:
        available_metrics.remove("count")
    except ValueError:
        pass
    if any(metric not in available_metrics for metric in metric_names):
        raise ValueError(f"Invalid metric. Metrics need to be in {available_metrics}")

    # Get classifier counts
    classifiers_N = split_results_handler.extract_count_from_metrics(split_metrics)

    # Sort classifiers by accuracy
    if sort_by_acc:
        mean_acc = {}
        for classifier in classifier_names:
            mean_acc[classifier] = np.mean(
                [split_metrics[split][classifier]["Accuracy"] for split in split_metrics]
            )
        classifier_names = sorted(
            classifier_names, key=lambda x: mean_acc[x], reverse=True
        )

    # Create subplots, one column for each metric
    fig = make_subplots(
        rows=1,
        cols=len(metric_names),
        subplot_titles=metric_names,
        horizontal_spacing=0.03,
    )

    color_group = px.colors.qualitative.Plotly
    colors = {
        classifier: color_group[i % len(color_group)]
        for i, classifier in enumerate(classifier_names)
    }

    point_pos = 0
    for i, metric in enumerate(metric_names):
        for classifier_name in classifier_names:
            values = [
                split_metrics[split][classifier_name][metric] for split in split_metrics
            ]

            fig.add_trace(
                go.Box(
                    y=values,
                    name=f"{classifier_name} (N={classifiers_N[classifier_name]})",
                    fillcolor=colors[classifier_name],
                    line=dict(color="black", width=1.5),
                    marker=dict(size=3, color="white", line_width=1),
                    boxmean=True,
                    boxpoints="all",
                    pointpos=point_pos,
                    showlegend=i == 0,  # Only show legend in the first subplot
                    hovertemplate="%{text}",
                    text=[
                        f"{split}: {value:.4f}"
                        for split, value in zip(split_metrics, values)
                    ],
                    legendgroup=classifier_name,
                    width=0.5,
                ),
                row=1,
                col=i + 1,
            )

    # Title
    title_text = (
        "Neural network classification - Metric distribution for 10-fold cross-validation"
    )
    if title:
        title_text = title
    fig.update_layout(title_text=title_text, **main_title_settings)

    # Layout
    fig.update_layout(
        yaxis_title="Value",
        boxmode="group",
        height=1200 * 0.8,
        width=1750 * 0.8,
    )

    # Acc, F1
    fig.update_layout(yaxis=dict(range=[0.88, 1.001]))
    fig.update_layout(yaxis2=dict(range=[0.80, 1.001]))

    # AUC
    range_auc = [0.986, 1.0001]
    fig.update_layout(yaxis3=dict(range=range_auc))
    fig.update_layout(yaxis4=dict(range=range_auc))

    if y_range is not None:
        fig.update_yaxes(range=y_range)

    # Save figure
    if logdir:
        if name is None:
            name = "MLP_metrics_various_tasks"
        fig.write_image(logdir / f"{name}.svg")
        fig.write_image(logdir / f"{name}.png")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

    return classifier_names

Compute metrics.

Code
exclude_categories = ["track_type", "group", "disease", "PE", "martin"]
# exclude_categories = ["track_type", "group", "disease"]
exclude_names = ["chip-seq", "7c", "16ct", "no-mixed"]

hdf5_type = "hg38_100kb_all_none"
results_dir = base_data_dir / "training_results" / "dfreeze_v2" / hdf5_type
PathChecker.check_directory(results_dir)

mislabel_correction = True
if mislabel_correction:
    mislabel_corrector = create_mislabel_corrector(paper_dir)
else:
    mislabel_corrector = None

split_results_metrics, all_split_results = split_results_handler.general_split_metrics(
    results_dir,
    merge_assays=True,
    exclude_categories=exclude_categories,
    exclude_names=exclude_names,
    return_type="both",
    oversampled_only=True,
    mislabel_corrections=mislabel_corrector,
    verbose=False,
)

Graph metrics.

Code
metrics_full = ["Accuracy", "F1_macro", "AUC_micro", "AUC_macro"]
fig_name = f"{hdf5_type}_perf_across_categories_full"
sorted_task_names = NN_performance_across_classification_tasks(
    split_results_metrics,  # type: ignore
    sort_by_acc=True,
    metric_names=metrics_full,
)

Fig. 2A,B: Distribution of accuracy and F1-score evaluated per training fold (dots) for each metadata classifier. Performance metrics are reported without applying a prediction score threshold. Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.

C - Class imbalance (Shannon Entropy)

Define function compute_class_imbalance.

Code
def compute_class_imbalance(
    all_split_results: Dict[str, Dict[str, pd.DataFrame]],
) -> pd.DataFrame:
    """Compute class imbalance for each task and split.

    Args:
        all_split_results: A dictionary with structure {task_name: {split_name: split_results_df}}.

    Returns:
        pd.DataFrame: A DataFrame with the following columns:
            - avg(balance_ratio): The average balance ratio for each task.
            - n: The number of classes for each task (used for the average).
    """
    # combine md5 lists
    task_md5s = {
        classifier_task: [split_df.index for split_df in split_results.values()]
        for classifier_task, split_results in all_split_results.items()
    }
    task_md5s = {
        classifier_task: [list(split_md5s) for split_md5s in md5s]
        for classifier_task, md5s in task_md5s.items()
    }
    task_md5s = {
        classifier_task: list(itertools.chain(*md5s))
        for classifier_task, md5s in task_md5s.items()
    }

    # get metadata
    metadata_df = metadata_handler.load_metadata_df("v2-encode")

    label_counts = {}
    for classifier_task, md5s in task_md5s.items():
        try:
            label_counts[classifier_task] = metadata_df.loc[md5s][
                classifier_task
            ].value_counts()
        except KeyError as e:
            category_name = classifier_task.rsplit("_", maxsplit=1)[0]
            try:
                label_counts[classifier_task] = metadata_df.loc[md5s][
                    category_name
                ].value_counts()
            except KeyError as e:
                raise e

    # Compute Shannon Entropy
    class_balance = {}
    for classifier_task, counts in label_counts.items():
        total_count = counts.sum()
        k = len(counts)
        p_x = counts / total_count  # class proportions
        p_x = p_x.values
        shannon_entropy = -np.sum(p_x * np.log2(p_x))
        balance = shannon_entropy / np.log2(k)
        class_balance[classifier_task] = (balance, k)

    df_class_balance = pd.DataFrame.from_dict(
        class_balance, orient="index", columns=["Normalized Shannon Entropy", "k"]
    ).sort_index()

    return df_class_balance

Compute class imbalance (Shannon entropy).

Code
subset = {
    k: v
    for k, v in all_split_results.items()  # type: ignore
    if not any(label in k for label in ["martin", "PE"])
}
df_class_balance = compute_class_imbalance(subset)  # type: ignore

Define graphing function plot_shannon_entropy.

Code
def plot_shannon_entropy(class_balance_df: pd.DataFrame, ordered_task_names: List[str]|None) -> None:
    """Graph Shannon entropy values, in the order given by task_names"""
    df = class_balance_df.copy()

    # Reorder df
    task_names = df.index
    if ordered_task_names:
        task_names = [
            task_name for task_name in task_names if task_name in ordered_task_names
    ]
    df = df.loc[sorted_task_names]

    # plot
    fig = px.scatter(
        df,
        x=df.index,
        y="Normalized Shannon Entropy",
        labels={
            "k": "Number of classes",
            "Normalized Shannon Entropy": "Normalized Shannon Entropy",
        },
        title="Class imbalance across tasks (higher is more balanced)",
    )
    fig.update_layout(
        yaxis=dict(range=[0, 1]),
        xaxis_title=None,
    )

    fig.show()

Graph.

Code
plot_shannon_entropy(
    class_balance_df=df_class_balance,
    ordered_task_names=sorted_task_names,
    )

Fig. 2C: Shannon entropy scores for each metadata category.

D - Donor Sex classification

IHEC_sample_metadata_harmonization.v1.1.extended.csv contains 314 EpiRRs with unknown sex. We applied a fully trained sex classifier on those.

To properly create sex classification related graphs (D,E) we need

- Classifier predictions on samples with unknown sex label
- Metadata pre/post correction
- Average ChrY signal for each file

Load v1.1 and v1.2 official metadata.

Code
metadata_v1_1_path = (
    official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.1.extended.csv"
)
metadata_v1_1 = pd.read_csv(metadata_v1_1_path, index_col=0)

metadata_v1_2_path = (
    official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.2.extended.csv"
)
metadata_v1_2 = pd.read_csv(metadata_v1_2_path, index_col=0)

Sanity check: make sure that the number of unknown labels is the same in our metadata VS official v1.1

Code
full_metadata_df = metadata_v2_df
full_metadata_df["md5sum"] = full_metadata_df.index
assert (
    metadata_v2_df[metadata_v2_df[SEX].isin(["unknown"])]["EpiRR"].nunique()
    == metadata_v1_1[metadata_v1_1[SEX] == "unknown"].index.nunique()
    == 314
)

Load predictions for unknown sex samples.

Code
sex_results_dir = (
    base_data_dir
    / "training_results"
    / "dfreeze_v2"
    / "hg38_100kb_all_none"
    / f"{SEX}_1l_3000n"
)
sex_full_model_dir = sex_results_dir / "complete_no_valid_oversample"
PathChecker.check_directory(sex_full_model_dir)

pred_unknown_file_path = (
    sex_full_model_dir
    / "predictions"
    / "complete_no_valid_oversample_test_prediction_100kb_all_none_dfreeze_v2.1_sex_mixed_unknown.csv"
)
pred_unknown_df = pd.read_csv(pred_unknown_file_path, index_col=0, header=0)

Join metadata to predictions.

Code
pred_unknown_df = pred_unknown_df[pred_unknown_df["True class"] == "unknown"]
pred_unknown_df = split_results_handler.add_max_pred(pred_unknown_df)  # type: ignore
pred_unknown_df = metadata_handler.join_metadata(pred_unknown_df, metadata_v2)
pred_unknown_df["md5sum"] = pred_unknown_df.index

Load 10-fold cross validation results.

Code
sex_10fold_dir = sex_results_dir / "10fold-oversampling"
PathChecker.check_directory(sex_10fold_dir)

split_results: Dict[str, pd.DataFrame] = split_results_handler.read_split_results(
    sex_10fold_dir
)
concat_results_10fold: pd.DataFrame = split_results_handler.concatenate_split_results(split_results, depth=1)  # type: ignore
concat_results_10fold = split_results_handler.add_max_pred(concat_results_10fold)
concat_results_10fold = metadata_handler.join_metadata(concat_results_10fold, metadata_v2)

Inner portion

Fig. 2D: Proportion of donor sex metadata originally annotated (inner circle, metadata v1.1) and predicted with high-confidence (outer circle, metadata v2.0) for female (red) and male (blue) (mixed sex not shown).

Compute values for the inner portion of the pie chart.

Code
# Proportion of unknown, excluding mixed. same as v1.1 ihec metadata
no_mixed = full_metadata_df[full_metadata_df[SEX] != "mixed"]

with pd.option_context("display.float_format", "{:.2%}".format):
    print("file-wise:")
    print(no_mixed[SEX].value_counts(dropna=False) / no_mixed.shape[0])

    print("\nEpiRR-wise:")
    epirr_no_mixed = no_mixed.drop_duplicates(subset=["EpiRR"])
    print(epirr_no_mixed[SEX].value_counts(dropna=False) / epirr_no_mixed.shape[0])
file-wise:
female    44.88%
male      42.41%
unknown   12.71%
Name: harmonized_donor_sex, dtype: float64

EpiRR-wise:
male      45.14%
female    40.60%
unknown   14.26%
Name: harmonized_donor_sex, dtype: float64

Outer portion

Outer ring represents SEX metadata labnels v1.2 (without mixed labels), which had those modifications:

  • Some unknown SEX files were labelled, using (assay,track type) z-score in conjunction with fully trained model predictions.
  • Correction of some mislabels, using 10fold cross-validation results
Code
meta_v1_2_no_mixed = metadata_v1_2[metadata_v1_2[SEX] != "mixed"]
with pd.option_context("display.float_format", "{:.2%}".format):
    print("EpiRR-wise:")
    print(meta_v1_2_no_mixed.value_counts(SEX) / meta_v1_2_no_mixed.shape[0])
EpiRR-wise:
harmonized_donor_sex
male      51.42%
female    46.54%
unknown    2.04%
dtype: float64

Average chrY values z-score distributions

Work done in another script:

  1. For each bigwig file, the chrY average value is computed. (with pyBigWig module, in chrY_bigwig_mean.py)
  2. For each assay, the z-score distribution (of the mean chrY value) of the file group is computed.
    • Outputs chrXY_all.csv
    • Outputs chrY_zscores.csv

Fig. 2E is made by averaging for each EpiRR the z-score value in each assay distribution.

This data is needed for the full sex mislabel context table.

Define function compute_chrY_zscores.

Code
def compute_chrY_zscores(
    chrY_dir: Path, version: str, save: bool = False
) -> pd.DataFrame:
    """Compute z-scores for chrY signal data.

    Computes two distributions of z-scores:
    1) Per assay group, excluding raw, pval, and Unique_raw tracks.
    2) Per assay+track group.

    In both cases, rna-seq/mrna-seq and wgbs-standard/wgbs-pbat are put as one assay.

    Args:
        chrY_dir: The directory containing the chrY signal data.
        version: The metadata version to use.
        save: Whether to save the results.

    Returns:
        pd.DataFrame: The chrY signal data with z-scores appended.
    """
    output_dir = Path()
    if save:
        output_dir = chrY_dir / f"dfreeze_{version}_stats"
        output_dir.mkdir(parents=False, exist_ok=True)

    # Get chrY signal data
    chrY_dir = base_data_dir / "chrY"
    PathChecker.check_directory(chrY_dir)
    chrY_df = pd.read_csv(chrY_dir / "chrXY_all.csv", header=0)

    # Filter out md5s not in metadata version
    metadata = MetadataHandler(paper_dir).load_metadata(version)
    md5s = set(metadata.md5s)
    chrY_df = chrY_df[chrY_df["filename"].isin(md5s)]

    # Make sure all values are non-zero
    if not (chrY_df["chrY"] != 0).all():
        raise ValueError("Some chrY values are zero.")

    # Merge metadata
    metadata_df = pd.DataFrame.from_records(list(metadata.datasets))
    metadata_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)
    chrY_df = chrY_df.merge(
        metadata_df, left_on="filename", right_on="md5sum"
    )

    # Compute stats for distributions
    metric_name_1 = "chrY_zscore_vs_assay_w_track_exclusion"
    metric_name_2 = "chrY_zscore_vs_assay_track"
    files1 = chrY_df[
        ~chrY_df["track_type"].isin(["raw", "pval", "Unique_raw"])
    ]
    files2 = chrY_df
    dist1 = files1.groupby(ASSAY).agg({"chrY": ["mean", "std", "count"]})
    dist2 = files2.groupby([ASSAY, "track_type"]).agg({"chrY": ["mean", "std", "count"]})
    if save:
        output_dir: Path
        dist1.to_csv(output_dir / "chrY_stats_assay_w_track_exclusion.csv")
        dist2.to_csv(output_dir / "chrY_stats_assay_and_track.csv")

    # Compute full z-score distributions
    for groups in files1.groupby(ASSAY):
        _, group_df = groups
        group_df["zscore"] = zscore(group_df["chrY"])
        chrY_df.loc[group_df.index, metric_name_1] = group_df["zscore"]
        chrY_df.loc[group_df.index, f"N_{metric_name_1}"] = groups[1].shape[0]
    for groups in files2.groupby([ASSAY, "track_type"]):
        _, group_df = groups
        group_df["zscore"] = zscore(group_df["chrY"])
        chrY_df.loc[group_df.index, metric_name_2] = group_df["zscore"]
        chrY_df.loc[group_df.index, f"N_{metric_name_2}"] = groups[1].shape[0]

    # Fill in missing values
    for N_name in [f"N_{metric_name_1}", f"N_{metric_name_2}"]:
        chrY_df[N_name] = chrY_df[N_name].fillna(0).astype(int)
    chrY_df.fillna(pd.NA, inplace=True)

    if save:
        output_cols = [
            "filename",
            ASSAY,
            "track_type",
            "chrY",
            metric_name_1,
            f"N_{metric_name_1}",
            metric_name_2,
            f"N_{metric_name_2}",
        ]
        chrY_df[output_cols].to_csv(
            output_dir / "chrY_zscores.csv", index=False, na_rep="NA"  # type: ignore
        )
    return chrY_df

Compute chrY zscores.

Code
chrY_dir = base_data_dir / "chrY"
PathChecker.check_directory(chrY_dir)

chrY_df = compute_chrY_zscores(chrY_dir, "v2", save=False)

Unknown predictions analysis file

The following folded code generates a similar table to what was used to determine which unknown sex sample to label. It was not used to produce any graph.

Code
def create_sex_pred_pivot_table(pred_unknown: pd.DataFrame, chrY_df: pd.DataFrame) -> pd.DataFrame:
    """Generate a pivot table containing group metrics per predicted label, for each EpiRR."""
    index_cols = [
        "EpiRR",
        "project",
        "harmonized_donor_type",
        CELL_TYPE,
        SEX,
        "Predicted class",
    ]
    pred_plus_chrY_df = pd.merge(
        pred_unknown,
        chrY_df,
        on="md5sum",
        suffixes=("", "_DROP")
    )
    pred_plus_chrY_df.drop(
        columns=[c for c in pred_plus_chrY_df.columns if c.endswith("_DROP")],
        inplace=True,
    )

    val_cols = ["Max pred", "chrY_zscore_vs_assay_track"]
    pivot_table = pred_plus_chrY_df.pivot_table(
        index=index_cols,
        values=val_cols,
        aggfunc=["mean", "median", "std", "count"],
    )

    return pivot_table


create_sex_pred_pivot_table(
    pred_unknown=pred_unknown_df,
    chrY_df=chrY_df,
)
mean median std count
Max pred chrY_zscore_vs_assay_track Max pred chrY_zscore_vs_assay_track Max pred chrY_zscore_vs_assay_track Max pred chrY_zscore_vs_assay_track
EpiRR project harmonized_donor_type harmonized_sample_ontology_intermediate harmonized_donor_sex Predicted class
IHECRE00000036.3 BLUEPRINT Single donor hematopoietic cell unknown male 0.781766 -0.081182 0.781766 -0.081182 NaN NaN 1 1
IHECRE00000042.3 BLUEPRINT Single donor hematopoietic cell unknown male 0.556080 -0.880438 0.556080 -0.880438 NaN NaN 1 1
IHECRE00000047.3 BLUEPRINT Single donor hematopoietic cell unknown female 0.769686 -0.876301 0.769686 -0.876301 NaN NaN 1 1
IHECRE00000067.3 BLUEPRINT Single donor hematopoietic cell unknown male 0.964230 0.620305 0.964230 0.620305 NaN NaN 1 1
IHECRE00000069.3 BLUEPRINT Single donor hematopoietic cell unknown male 0.978939 0.398611 0.978939 0.398611 NaN NaN 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
IHECRE00004904.1 NIH Roadmap Epigenomics Single donor neural progenitor cell unknown female 0.679335 -0.536291 0.679335 -0.536291 0.013960 0.113480 2 2
IHECRE00004908.1 NIH Roadmap Epigenomics Single donor placenta unknown female 0.590977 -0.458150 0.590977 -0.458150 NaN NaN 1 1
mixed 0.893086 -0.682532 0.893086 -0.682532 NaN NaN 1 1
IHECRE00004910.1 NIH Roadmap Epigenomics Single donor neural progenitor cell unknown female 0.753894 -0.288177 0.753894 -0.288177 NaN NaN 1 1
male 0.956916 0.402645 0.983646 -0.383976 0.065293 1.434576 5 5

409 rows × 8 columns


10fold cross-validation predictions analysis file

This section generates a similar table to what was used to determine which EpiRR sex labels might be mistaken. It aggregates results from the 10 cross-validation classifiers.

Code
cross_val_analysis = concat_results_10fold.merge(chrY_df, left_index=True, right_on="md5sum", suffixes=("", "_DROP"))  # type: ignore
cross_val_analysis.drop(
    columns=[c for c in cross_val_analysis.columns if c.endswith("_DROP")], inplace=True
)
Code
index_cols = [
    "EpiRR",
    "project",
    "harmonized_donor_type",
    CELL_TYPE,
    SEX,
    "Predicted class",
]
val_cols = ["Max pred", "chrY_zscore_vs_assay_track"]

# not directly used in full mislabel analysis
to_drop = [
    "N_chrY_zscore_vs_assay_w_track_exclusion",
    "chrY_zscore_vs_assay_w_track_exclusion",
]
cross_val_analysis_track = cross_val_analysis.drop(to_drop, axis=1)

pivot_table = cross_val_analysis_track.pivot_table(
    index=index_cols,
    values=val_cols,
    aggfunc=["mean", "median", "std", "count"],
)

display(pivot_table.head())
mean median std count
Max pred chrY_zscore_vs_assay_track Max pred chrY_zscore_vs_assay_track Max pred chrY_zscore_vs_assay_track Max pred chrY_zscore_vs_assay_track
EpiRR project harmonized_donor_type harmonized_sample_ontology_intermediate harmonized_donor_sex Predicted class
IHECRE00000001.4 CEEHRC Single donor epithelial cell derived cell line female female 0.964847 -0.649520 0.988930 -0.677161 0.063663 0.405152 20 20
mixed 0.610971 -0.378015 0.570102 -0.568708 0.137748 0.653722 3 3
IHECRE00000002.3 BLUEPRINT Single donor myeloid cell female female 0.963702 -0.127269 0.996254 -0.554242 0.096154 1.082720 20 20
IHECRE00000004.3 BLUEPRINT Single donor neutrophil female female 0.833395 -0.040080 0.900267 -0.542614 0.177544 1.395165 18 18
male 0.651440 0.404992 0.651231 -0.627901 0.115929 2.277509 5 5

E - Average EpiRR z-score for 10fold

Define function zscore_merged_assays that computes the chrY average signal metric.

Code
def zscore_merged_assays(
    zscore_df: pd.DataFrame,
    sex_mislabels: Dict[str, str],
    name: str | None = None,
    logdir: Path | None = None,
    min_pred: float | None = None,
    no_rna: bool = False,
) -> None:
    """Male vs Female z-score distribution for merged assays, excluding wgbs.

    Does not include pval and raw tracks.

    Highlights mislabels in the plot.

    Args:
        zscore_df (pd.DataFrame): The dataframe with z-score data.
        sex_mislabels (Dict[str, str]): {EpiRR_no-v: corrected_sex_label}
        logdir (Path): The directory path to save the output plots. If None, only display the plot.
        name (str): The base name for the output plot files.
        min_pred (float|None): Minimum prediction value to include in the plot. Used on average EpiRR 'Max pred' values.
        no_rna (bool): Whether to exclude rna_seq from the plot.
    """
    zscore_df = zscore_df.copy(deep=True)

    # Remove pval/raw tracks + rna unstranded
    zscore_df = zscore_df[~zscore_df["track_type"].isin(["pval", "raw", "Unique_raw"])]

    # Merge rna protocols
    zscore_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)

    # wgbs reverses male/female chrY tendency, so removed here
    zscore_df = zscore_df[~zscore_df[ASSAY].str.contains("wgb")]

    if no_rna:
        zscore_df = zscore_df[~zscore_df[ASSAY].str.contains("rna")]

    N_assays = len(zscore_df[ASSAY].unique())
    print(
        f"Average chrY z-score values computed from:\n{zscore_df[ASSAY].value_counts(dropna=False)}"
    )

    # Average chrY z-score values
    metric_label = "chrY_zscore_vs_assay_w_track_exclusion"
    zscore_df = zscore_df[zscore_df[metric_label] != "NA"]
    mean_chrY_values_df = zscore_df.groupby(["EpiRR", SEX]).agg(
        {metric_label: "mean", "Max pred": "mean"}
    )
    mean_chrY_values_df.reset_index(inplace=True)
    if not mean_chrY_values_df["EpiRR"].is_unique:
        raise ValueError("EpiRR is not unique.")

    # Filter out low prediction values
    if min_pred is not None:
        mean_chrY_values_df = mean_chrY_values_df[
            mean_chrY_values_df["Max pred"] > min_pred
        ]

    mean_chrY_values_df.reset_index(drop=True, inplace=True)
    mean_chrY_values_df["EpiRR_no_v"] = mean_chrY_values_df["EpiRR"].str.extract(
        pat=r"(\w+\d+).\d+"
    )[0]

    chrY_values = mean_chrY_values_df[metric_label]
    female_idx = np.argwhere((mean_chrY_values_df[SEX] == "female").values).flatten()  # type: ignore
    male_idx = np.argwhere((mean_chrY_values_df[SEX] == "male").values).flatten()  # type: ignore

    # Mislabels
    predicted_as_female = set(
        epirr_no_v for epirr_no_v, label in sex_mislabels.items() if label == "female"
    )
    predicted_as_male = set(
        epirr_no_v for epirr_no_v, label in sex_mislabels.items() if label == "male"
    )
    predicted_as_female_idx = np.argwhere(mean_chrY_values_df["EpiRR_no_v"].isin(predicted_as_female).values).flatten()  # type: ignore
    predicted_as_male_idx = np.argwhere(mean_chrY_values_df["EpiRR_no_v"].isin(predicted_as_male).values).flatten()  # type: ignore

    print(
        f"Adding mislabels to graph: {len(predicted_as_female_idx)} male->female, {len(predicted_as_male_idx)} female->male"
    )

    # Hovertext
    hovertext = [
        f"{epirr}: <z-score>={z_score:.3f}"
        for epirr, z_score in zip(
            mean_chrY_values_df["EpiRR"],
            mean_chrY_values_df[metric_label],
        )
    ]
    hovertext = np.array(hovertext)

    # sanity check
    if mean_chrY_values_df["EpiRR"].nunique() != mean_chrY_values_df.shape[0]:
        raise ValueError("EpiRR is not unique.")

    # Create figure
    fig = go.Figure()
    fig.add_trace(
        go.Box(
            name="Female",
            x=[0] * len(female_idx),
            y=chrY_values[female_idx],  # type: ignore
            boxmean=True,
            boxpoints="all",
            pointpos=-2,
            hovertemplate="%{text}",
            text=hovertext[female_idx],
            marker=dict(size=2),
            line=dict(width=1, color="black"),
            fillcolor=sex_colors["female"],
            showlegend=True,
        ),
    )

    fig.add_trace(
        go.Box(
            name="Male",
            x=[1] * len(female_idx),
            y=chrY_values[male_idx],  # type: ignore
            boxmean=True,
            boxpoints="all",
            pointpos=-2,
            hovertemplate="%{text}",
            text=hovertext[male_idx],
            marker=dict(size=2),
            line=dict(width=1, color="black"),
            fillcolor=sex_colors["male"],
            showlegend=True,
        ),
    )

    fig.add_trace(
        go.Scatter(
            name="Male",
            x=[-0.5] * len(predicted_as_male_idx),
            y=chrY_values[predicted_as_male_idx],  # type: ignore
            mode="markers",
            marker=dict(
                size=10, color=sex_colors["male"], line=dict(width=1, color="black")
            ),
            hovertemplate="%{text}",
            text=hovertext[predicted_as_male_idx],
            showlegend=False,
        ),
    )

    fig.add_trace(
        go.Scatter(
            name="Female",
            x=[0.5] * len(predicted_as_female_idx),
            y=chrY_values[predicted_as_female_idx],  # type: ignore
            mode="markers",
            marker=dict(
                size=10, color=sex_colors["female"], line=dict(width=1, color="black")
            ),
            hovertemplate="%{text}",
            text=hovertext[predicted_as_female_idx],
            showlegend=False,
        ),
    )

    fig.update_xaxes(showticklabels=False)

    fig.update_yaxes(range=[-1.5, 3])
    title = f"z-score(mean chrY signal per file) distribution - z-scores averaged over {N_assays} assays"
    if min_pred is not None:
        title += f"<br>avg_maxPred>{min_pred}"

    fig.update_layout(
        title=dict(text=title, x=0.5),
        xaxis_title=SEX,
        yaxis_title="Average z-score",
        width=750,
        height=750,
    )

    # Save figure
    if logdir:
        this_name = f"{metric_label}_n{mean_chrY_values_df.shape[0]}"
        if name:
            this_name = f"{name}_{this_name}"
        fig.write_image(logdir / f"{this_name}.svg")
        fig.write_image(logdir / f"{this_name}.png")
        fig.write_html(logdir / f"{this_name}.html")

    fig.show()

Graph zscores.

Code
zscore_merged_assays(
    zscore_df=cross_val_analysis,
    sex_mislabels=create_mislabel_corrector(paper_dir)[1][SEX],
    name="no_RNA",
    no_rna=True,
)
Average chrY z-score values computed from:
h3k27ac     1447
h3k4me1      850
h3k4me3      678
input        657
h3k36me3     595
h3k27me3     575
h3k9me3      537
Name: assay_epiclass, dtype: int64
Adding mislabels to graph: 8 male->female, 8 female->male

Fig 2E: Distribution of the average z-score signal of epigenomes (dots) over chrY, computed on the ChIP-Seq datasets (up to 7 assays per epigenome) using the fold change track type files for female (red) and male (blue). Originally mislabeled epigenomes are shown as big dots. Boxplot elements are as in Fig. 2A.

F - Donor life stage and GP-Age

Load official IHEC sample metadata.

Code
meta_v1_2_df = pd.read_csv(
    official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.2.extended.csv"
)
meta_v1_1_df = pd.read_csv(
    official_metadata_dir / "IHEC_sample_metadata_harmonization.v1.1.extended.csv"
)

Sanity check, both metadata versions have the same ‘cell line’ samples.

Code
cell_lines_v11 = meta_v1_1_df[meta_v1_1_df[BIOMATERIAL_TYPE] == "cell line"][
    "epirr_id_without_version"
].unique()
cell_lines_v12 = meta_v1_2_df[meta_v1_2_df[BIOMATERIAL_TYPE] == "cell line"][
    "epirr_id_without_version"
].unique()
assert set(cell_lines_v11) == set(cell_lines_v12)

Load GP-age results.

Code
gp_age_dir = base_data_dir / "GP_age"
PathChecker.check_directory(gp_age_dir)

df_gp_age = pd.read_csv(gp_age_dir / "life_stage_prediction.GPage.20250513.tsv", sep="\t")
df_gp_age["graph_age"] = df_gp_age["model30"]

epiclass_pred_col = "epiclass_predicted_lifestage"

display(df_gp_age.head())
epirr_noV version model30 model71 model30_imputed_data model71_imputed_data nbCpG_model30 nbCpG_model71 harmonized_donor_life_stage harmonized_donor_age harmonized_donor_age_unit epiclass_predicted_lifestage tissue tissue_subtype graph_age
0 IHECRE00000001 4 39.356 28.973 71.782 62.627 30 71 unknown unknown unknown adult_pred REPR breast 39.356
1 IHECRE00000004 3 49.972 46.109 53.737 49.893 30 70 adult 60-65 year adult IMMU blood-venous 49.972
2 IHECRE00000008 3 55.681 51.238 40.128 36.052 28 67 adult 50-55 year adult IMMU blood-venous 55.681
3 IHECRE00000013 3 61.915 64.748 35.262 32.777 29 67 child 0-5 year adult_pred IMMU blood-venous 61.915
4 IHECRE00000014 2 48.480 49.848 71.979 64.951 29 67 adult 70-75 year adult IMMU bone-marrow 48.480

The epiclass_predicted_lifestage column is manually curated, from analyzing all predictions for all samples. When an expected label is known, and epiclass predictions are inconclusive (low average max pred and/or no majority consensus), the expected label is kept.

Columns tissue and tissue_subtype are formatting of harmonized_sample_organ_system_order_AnetaMikulasova and harmonized_sample_organ_order_AnetaMikulasova.
Following code remaps 5 life stage classes to 3 (perinatal, pediatric, adult).

Code
df_gp_age.loc[:, "graph_age_categories"] = df_gp_age[epiclass_pred_col].str.removesuffix(
    "_pred"
)

gp_age_categories = {
    "adult": "adult",
    "child": "pediatric",
    "embryonic": "perinatal",
    "fetal": "perinatal",
    "newborn": "perinatal",
    "unknown": "unknown",
}
df_gp_age.loc[:, "graph_age_categories"] = df_gp_age["graph_age_categories"].map(
    gp_age_categories
)
display(df_gp_age["graph_age_categories"].value_counts(dropna=False))
adult        479
pediatric     54
perinatal     53
unknown       48
Name: graph_age_categories, dtype: int64

Here we merge GP-age data with additional EpiAtlas metadata.

Code
epirr_col = "epirr_id_without_version"

merged_dp_age = pd.merge(
    df_gp_age,
    meta_v1_2_df[[epirr_col, CELL_TYPE, BIOMATERIAL_TYPE]],
    left_on="epirr_noV",
    right_on=epirr_col,
    how="left",
)
merged_dp_age.drop_duplicates(subset=[epirr_col], inplace=True)
merged_dp_age.drop(columns=["epirr_noV"], inplace=True)

Excluding ‘cell line’ samples from considered data.

Code
# Removing cell lines: life stage makes less sense
# type: ignore
N_before = merged_dp_age.shape[0]
merged_dp_age: pd.DataFrame = merged_dp_age[
    merged_dp_age[BIOMATERIAL_TYPE] != "cell line"
]
print(f"Removed {N_before - merged_dp_age.shape[0]} cell lines.")
Removed 29 cell lines.

Creating the graph categories.

We need to categorize separately whole blood since from other tissues since GP-Age training is only made of whole blood. We keep ‘immune system’ tissues, specifically venous and umbilical blood since they match the most closely to the training data. unsure about blood marrow.

Code
layer1_vals = ["IMMU"]
layer2_vals = ["blood-umbilical-cord", "blood-venous"]

merged_dp_age.loc[:, "tissue_group"] = [
    "blood" if (val1 in layer1_vals and val2 in layer2_vals) else "other"
    for val1, val2 in merged_dp_age.loc[:, ["tissue", "tissue_subtype"]].values
]

Sanity check, no NaN present.

Code
important_cols = [
    "epirr_id_without_version",
    "tissue_group",
    epiclass_pred_col,
    "graph_age",
]
missing_N = merged_dp_age.loc[:, important_cols].isna().sum().sum()
if missing_N > 0:
    raise ValueError(f"Missing values in merged_dp_age: {missing_N}")

Define graphing function graph_gp_age.

Code
def graph_gp_age(
    df_gp_age: pd.DataFrame,
    logdir: Path | None = None,
    name: str | None = None,
    title: str | None = None,
) -> None:
    """
    Plot the GP age predictions.

    Args:
        df_gp_age: The dataframe with GP age data.
    """
    df = df_gp_age.copy(deep=True)

    tissue_colors = {"blood": "red", "other": "gray"}

    age_cat_label = "graph_age_categories"

    fig = go.Figure()
    for tissue_group in sorted(df["tissue_group"].unique()):
        sub_df = df[df["tissue_group"] == tissue_group]
        fig.add_trace(
            go.Box(
                name=f"{tissue_group} (n={len(sub_df)})",
                x=sub_df[age_cat_label],
                y=sub_df["graph_age"],
                boxmean=True,
                boxpoints="all",
                hovertemplate="%{text}",
                text=[
                    f"{ct}: {age:.3f}"
                    for ct, age in zip(sub_df[CELL_TYPE], sub_df["graph_age"])
                ],
                marker=dict(size=2, color=tissue_colors[tissue_group]),
                showlegend=True,
            ),
        )

    fig.update_layout(
        title=f"GP age predictions - Using MLP predicted labels ({title})",
        xaxis_title="Life stage",
        yaxis_title="GP-Age : Predicted age",
        width=750,
        height=750,
        boxmode="group",
    )

    # Order x-axis
    label_order = ["perinatal", "pediatric", "adult"]
    axis_labels = [
        f"{age_cat} (n={(df[age_cat_label] == age_cat).sum()})" for age_cat in label_order
    ]

    fig.update_xaxes(categoryorder="array", categoryarray=label_order)
    fig.update_xaxes(tickvals=[0, 1, 2], ticktext=axis_labels)

    # Save figure
    if logdir:
        if name is None:
            name = "GP_age_predictions"
        fig.write_image(logdir / f"{name}.svg")
        fig.write_image(logdir / f"{name}.png")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

Define dataframe content function display_tissue_age_df.

Code
def display_tissue_age_df(df: pd.DataFrame) -> None:
    """Display tissue group count breakdown."""
    count = df.groupby(["tissue_group", "graph_age_categories"]).size()
    print("Tissue group summary")
    print(f"Blood: {count.loc['blood'].sum()}")  # type: ignore
    print(f"Other: {count.loc['other'].sum()}")  # type: ignore
    print(f"Total: {count.sum()}")
    print("Detail:")
    print(count)

For initially unknown labels, only predictions that were manually confirmed as mislabels.

Code
# Remove inconclusive predictions
gp_graph_df = merged_dp_age[merged_dp_age[epiclass_pred_col] != "unknown"]

only_unknown_df: pd.DataFrame = gp_graph_df[gp_graph_df[LIFE_STAGE] == "unknown"]
display_tissue_age_df(only_unknown_df)

graph_gp_age(
    df_gp_age=only_unknown_df,
    name="GP_age_predictions_unknown_only_MLP_predicted_labels",
    title="Samples with unknown life stage",
)
Tissue group summary
Blood: 82
Other: 63
Total: 145
Detail:
tissue_group  graph_age_categories
blood         adult                   76
              pediatric                3
              perinatal                3
other         adult                   52
              pediatric                3
              perinatal                8
dtype: int64

Fig 2F: Distribution of the age prediction from GP-age for the WGBS datasets with an originally unknown life stage predicted by EpiClass (datasets related to blood biospecimen are in red, others in grey). Boxplot elements are as in Fig. 2A.

G - Biospecimen classifier - ChromScore for high-SHAP regions

See src/python/epiclass/utils/notebooks/analyze_hdf5_vals.ipynb, particularly section “ChromScore hdf5 values”.

Fig 2G: Distribution of the average ChromScore values over the important Biospecimen classifier regions according to SHAP (pink) compared to the global distribution (grey). Statistical significance was assessed using a two-sided Welch’s t-test. Boxplot elements are as in Fig. 2A, with a violin representation on top.

H - Gene Ontology and SHAP values

See src/python/epiclass/utils/notebooks/profile_bed.ipynb for creation of gene ontology files. The code compares important SHAP values regions of different cell types with gene gff (using the gProfiler module)

Define GO terms.

Code
selected_cell_types = [
    "T_cell",
    "neutrophil",
    "lymphocyte_of_B_lineage",
    "brain",
    "hepatocyte",
]
go_terms_table = [
    "T cell receptor complex",
    "plasma membrane signaling receptor complex",
    "adaptive immune response",
    "receptor complex",
    "secretory granule",
    "secretory vesicle",
    "secretory granule membrane",
    "intracellular vesicle",
    "immunoglobulin complex",
    "immune response",
    "immune system process",
    "homophilic cell adhesion via plasma membrane adhesion molecules",
    "DNA binding",
    "cell-cell adhesion via plasma-membrane adhesion molecules",
    "RNA polymerase II cis-regulatory region sequence-specific DNA binding",
    "blood microparticle",
    "platelet alpha granule lumen",
    "fibrinogen complex",
    "endoplasmic reticulum lumen",
]

# Add some line breaks for readability
go_terms_graph = [
    "T cell receptor complex",
    "plasma membrane<br>signaling receptor complex",
    "adaptive immune response",
    "receptor complex",
    "secretory granule",
    "secretory vesicle",
    "secretory granule membrane",
    "intracellular vesicle",
    "immunoglobulin complex",
    "immune response",
    "immune system process",
    "homophilic cell adhesion via<br>plasma membrane adhesion molecules",
    "DNA binding",
    "cell-cell adhesion via<br>plasma-membrane adhesion molecules",
    "RNA polymerase II cis-regulatory<br>region sequence-specific DNA binding",
    "blood microparticle",
    "platelet alpha granule lumen",
    "fibrinogen complex",
    "endoplasmic reticulum lumen",
]

Setup paths.

Code
hdf5_type = "hg38_100kb_all_none"

SHAP_dir = base_data_dir / "SHAP"

cell_type_shap_dir = (
    SHAP_dir / hdf5_type / f"{CELL_TYPE}_1l_3000n" / "10fold-oversampling"
)
beds_file = cell_type_shap_dir / "select_beds_top303.tar.gz"
PathChecker.check_file(beds_file)

Load top g:profiler results for each cell type.

Code
all_go_dfs: Dict[str, pd.DataFrame] = {}
with tarfile.open(beds_file, "r:gz") as tar:
    for member in tar.getmembers():
        filename = member.name
        if filename.endswith("profiler.tsv") and "merge_samplings" in filename:
            with tar.extractfile(member) as f:  # type: ignore
                go_df = pd.read_csv(f, sep="\t", index_col=0)
                all_go_dfs[member.name] = go_df

assert len(all_go_dfs) == 16

Merge results for all cell types into one table.

Code
for name, df in all_go_dfs.items():
    sub_df = df.copy()
    sub_df.loc[:, "shap_source"] = re.match(r".*/merge_samplings_(.*)_features_intersect_gff_gprofiler.tsv", name).group(1)  # type: ignore
    sub_df.loc[:, "table_val"] = -np.log10(sub_df.loc[:, "p_value"])
    all_go_dfs[name] = sub_df

full_concat_df = pd.concat(all_go_dfs.values())
full_concat_df = full_concat_df.drop(["significant", "query"], axis=1)

assert all(go_term in full_concat_df["name"].values for go_term in go_terms_table)

table = full_concat_df.pivot_table(
    index="name", columns="shap_source", values="table_val", aggfunc="mean"
)

Collect subset of table for the five selected cell types.

Code
table_5_ct = table.loc[go_terms_table, selected_cell_types].copy()
assert table_5_ct.shape == (len(go_terms_graph), len(selected_cell_types))

# Rename index
table_5_ct = pd.DataFrame(table_5_ct, index=go_terms_graph, columns=table_5_ct.columns)

Define graphing function plot_go_heatmap.

Code
def plot_go_heatmap(table: pd.DataFrame, width: int, height: int, title: str|None=None):
    """Plot a GO heatmap."""
    # Define colorbar
    sigma = "\u03c3"

    colorbar = dict(
        title="-log<sub>10</sub>(p-value)",
        tickvals=[0, 1.30, 2, 3, 5, 6.53, 10],
        ticktext=[
            "0",
            f"1.30: p=0.05~2{sigma}",
            "2: p=0.01",
            f"3: p=0.001~3{sigma}",
            "5",
            f"6.53: p=3x10<sup>-7</sup> = 5{sigma}",
            "10",
        ],
    )

    # keep NaNs in for labeling
    z = table.values
    text = np.where(np.isnan(z), "NS", np.char.mod("%.2f", z))

    fig = go.Figure(
        data=go.Heatmap(
            z=np.nan_to_num(z, nan=0),  # replace NaN with 0 for coloring
            x=table.columns,
            y=table.index,
            colorscale="Blues",
            zmin=0,
            zmax=10,
            colorbar=colorbar,
            text=text,
            texttemplate="%{text}",
            hovertemplate="GO Term: %{y}<br>Class: %{x}<br>Value: %{z:.2f}<extra></extra>",
            showscale=True,
            xgap=2,
            ygap=2,
        )
    )

    fig.update_layout(
        title_text=title,
        width=width,
        height=height,
        plot_bgcolor="black",
    )
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)

    fig.show()

Graph GO for 5 cell types.

Code
plot_go_heatmap(
    table=table_5_ct,
    width=600,
    height=1000,
)

Fig 2H: Top four gene ontology (GO) terms enriched by g:Profiler[35] for genes within important Biospecimen classifier regions according to SHAP values.

5 top pval terms for each biospecimen label

Collect the 5 most significants terms for each the 16 cell types.

Code
# preserve order
top_5_terms = []
for name, df in all_go_dfs.items():
    df.sort_values("p_value", inplace=True)
    top_5 = df["name"].head(5).to_list()
    top_5_terms.extend(top_5)

top_5_terms = list(dict.fromkeys(top_5_terms))

Graph.

Code
table_top_5_GO = table.loc[top_5_terms, :].copy()

plot_go_heatmap(
    table=table_top_5_GO,
    width=1300,
    height=2000,
)

I - Biospecimen important regions chromatin states

Images extracted from Epilogos viewer, using: - Region: chr11:118300000-118400000 - View mode: Single - Dataset: IHEC - All biosamples - Saliency Metric: S1

Fig. 2I: Epilogos visualization of one of the important Biospecimen classifier regions enriched in the T cell receptor complex GO term from Fig. 2H.

J - Copy Number Alteration (CNA) Signatures and SHAP values

Note: The code uses the acronym CNV (Copy Number Variant/Variation) instead of CNA.

See CNV_treatment.ipynb for the creation of the CNA stats. The following figures represent the Z-scores of the signal within the top SHAP features (N=336) vs 200 random feature sets of same size, for signatures created using patients samples with cancer types similar to EpiAtlas content (see paper Methods).

Load z-scores.

Code
cnv_dir = base_data_dir / "CNV"
cnv_intersection_results = (
    cnv_dir
    / "signature_analysis"
    / "epiatlas_cancer_types"
    / "important_cancer_features_z_scores_vs_random200.tsv"
)
PathChecker.check_file(cnv_intersection_results)

cnv_df = pd.read_csv(cnv_intersection_results, sep="\t", index_col=0)
cnv_df.name = cnv_intersection_results.stem

Define graphing function plot_cnv_zscores, which uses the CNA groups defined by Signatures of copy number alterations in human cancer. See Supplementary Table 8 for details.

Code
def plot_cnv_zscores(cnv_df: pd.DataFrame, logdir: Path | None = None) -> None:
    """Plot z-scores of top SHAP features vs random feature sets, grouped by CNV groups.

    Args:
        cnv_df: The DataFrame with z-scores.
        logdir: The output directory to save the plot.
    """
    n_beds = int(cnv_df.name.split("random")[1])
    signature_subset_name = "EpiATLAS cancer types"

    CN_groups = [
        [f"CN{i}" for i in range(1, 4)],
        [f"CN{i}" for i in range(9, 13)],
        [f"CN{i}" for i in range(13, 17)],
        [f"CN{i}" for i in range(17, 18)],
        [f"CN{i}" for i in range(18, 22)],
        [f"CN{i}" for i in range(4, 9)],
    ]
    CN_names = [
        "CN1-CN3",
        "CN9-CN12",
        "CN13-CN16",
        "CN17",
        "CN18-CN21",
        "CN4-CN8",
    ]

    # Assign groups to the DataFrame
    cnv_df["group"] = "Other"
    for i, group in enumerate(CN_groups):
        cnv_df.loc[cnv_df.index.isin(group), "group"] = CN_names[i]

    # Sort groups
    group_medians = (
        cnv_df.groupby("group")["z_score"].median().sort_values(ascending=False)
    )
    sorted_CN_names = group_medians.index.tolist()

    # Create the figure
    fig = go.Figure()

    for group in sorted_CN_names:
        group_data = cnv_df[cnv_df["group"] == group]
        marker_size = 4 if group != "CN17" else 6

        # Add the box plot without points
        fig.add_trace(
            go.Box(
                y=group_data["z_score"],
                name=group,
                boxmean=True,
                boxpoints=False,  # Don't show points in the box plot
                line=dict(color="black"),
                fillcolor="rgba(255,255,255,0)",
                showlegend=False,
            )
        )

        # Add scatter plot for individual points
        fig.add_trace(
            go.Scatter(
                x=[group] * len(group_data),
                y=group_data["z_score"],
                mode="markers",
                marker=dict(
                    color="red",
                    size=marker_size,
                ),
                name=group,
                showlegend=False,
                text=group_data.index,  # Use CN names as hover text
                hoverinfo="text+y",  # Show CN name and y-value on hover
            )
        )
    # Update layout
    fig.update_layout(
        xaxis_title="CNA Group",
        yaxis_title="Z-score",
        **main_title_settings
    )

    # Add a horizontal line at y=0 for reference
    fig.add_hline(y=0, line_color="grey", line_width=1)

    # Show and save the figure
    if logdir:
        name = "important_cancer_features_z_scores_boxplot"
        fig.write_image(logdir / f"{name}.png")
        fig.write_image(logdir / f"{name}.svg")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

Graph.

Code
plot_cnv_zscores(cnv_df)

Define the graphing function plot_cnv_zscores_alt, which groups CNA by focal vs chromosomal events.

Code
def plot_cnv_zscores_alt(cnv_df: pd.DataFrame, logdir: Path | None = None) -> None:
    """Plot z-scores of top SHAP features vs random feature sets, grouped by focal vs chromosomal events.

    Args:
        cnv_df: The DataFrame with z-scores.
        logdir: The output directory to save the plot.
    """
    n_beds = int(cnv_df.name.split("random")[1])
    signature_subset_name = "EpiATLAS cancer types"

    # Assign groups
    CN_groups = [
        [f"CN{i}" for i in range(1, 9)] + [f"CN{i}" for i in range(13, 17)],
        [f"CN{i}" for i in range(9, 13)] + [f"CN{i}" for i in range(17, 22)],
    ]
    CN_names = [
        "Chromosomal events (CN1-CN8, CN13-CN16)",
        "Focal events (CN9-CN12, CN17-CN21)",
    ]

    # Assign groups to the DataFrame
    cnv_df["group"] = "Other"
    for i, group in enumerate(CN_groups):
        cnv_df.loc[cnv_df.index.isin(group), "group"] = CN_names[i]

    # Sort groups
    group_medians = (
        cnv_df.groupby("group")["z_score"].median().sort_values(ascending=False)
    )
    sorted_CN_names = group_medians.index.tolist()

    # Create the figure
    fig = go.Figure()

    for group in sorted_CN_names:
        group_data = cnv_df[cnv_df["group"] == group]
        hover_text = [
            f"{CN_name}: Z={val:.3f}"
            for CN_name, val in zip(group_data.index, group_data["z_score"])
        ]

        # Add the box plot without points
        fig.add_trace(
            go.Box(
                y=group_data["z_score"],
                name=group,
                boxmean=True,
                boxpoints="all",
                line=dict(color="black"),
                fillcolor="rgba(255,255,255,0)",
                showlegend=False,
                hoverinfo="text",  # Show CN name and y-value on hover
                hovertext=hover_text,
            )
        )

    # Update layout
    fig.update_layout(
        xaxis_title="Copy Number Alteration Event Type",
        yaxis_title="CNA Enrichment (Z-score)",
        **main_title_settings
    )

    # Add a horizontal line at y=0 for reference
    fig.add_hline(y=0, line_color="grey", line_width=1)

    # Show and save the figure
    if logdir:
        name = "important_cancer_features_z_scores_boxplot_V2"
        fig.write_image(logdir / f"{name}.png")
        fig.write_image(logdir / f"{name}.svg")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

Graph.

Code
plot_cnv_zscores_alt(cnv_df)

Fig. 2J: Distribution of the z-score enrichment for copy number (CN) signatures (dots) in genomic regions identified as important by the Cancer status classifier compared to random control regions. CN signatures are grouped as being mostly associated with focal changes CN9-12,17-21) or chromosomal ones (CN1-8,13-16). Boxplot elements are as in Fig 2A.

Supplementary Figure 5

A - MLP performance on various classification tasks (100kb resolution)

See Figure 2A,B

B - chrY signal per sex: Average EpiRR z-score per assay

Define function zscore_per_assay to compute and graph the metric for each assay instead of globally.

Code
def zscore_per_assay(
    zscore_df: pd.DataFrame, logdir: Path | None = None, name: str | None = None
) -> None:
    """
    Plot the z-score distributions per assay.

    Does not include pval and raw tracks.

    Args:
        zscore_df: The dataframe with z-score data.
    """
    zscore_df = zscore_df.copy(deep=True)

    # Remove pval/raw tracks + rna unstranded
    zscore_df = zscore_df[~zscore_df["track_type"].isin(["pval", "raw", "Unique_raw"])]

    # Merge rna protocols
    zscore_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)

    # Remove NAs
    metric_label = "chrY_zscore_vs_assay_w_track_exclusion"
    zscore_df = zscore_df[zscore_df[metric_label] != "NA"]

    assay_sizes = zscore_df[ASSAY].value_counts()
    assays = sorted(assay_sizes.index)

    x_title = "Sex z-score distributions per assay"
    fig = make_subplots(
        rows=1,
        cols=len(assays),
        shared_yaxes=True,
        x_title=x_title,
        y_title="z-score",
        horizontal_spacing=0.02,
        subplot_titles=[
            f"{assay_label} ({assay_sizes[assay_label]})" for assay_label in assays
        ],
    )

    for i, assay_label in enumerate(sorted(assays)):
        sub_df = zscore_df[zscore_df[ASSAY] == assay_label]

        hovertext = [
            f"{epirr}: z-score={z_score:.3f}, pred={pred:.3f}"
            for epirr, pred, z_score in zip(
                sub_df["EpiRR"],
                sub_df["Max pred"],
                sub_df[metric_label],
            )
        ]
        hovertext = np.array(hovertext)

        sub_df.reset_index(drop=True, inplace=True)
        y_values = sub_df[metric_label].values

        female_idx = np.argwhere((sub_df[SEX] == "female").values).flatten()
        male_idx = np.argwhere((sub_df[SEX] == "male").values).flatten()

        fig.add_trace(
            go.Box(
                name=assay_label,
                y=y_values[female_idx],
                boxmean=True,
                boxpoints="all",
                hovertemplate="%{text}",
                text=hovertext[female_idx],
                marker=dict(
                    size=2,
                    color=sex_colors["female"],
                    line=dict(width=0.5, color="black"),
                ),
                fillcolor=sex_colors["female"],
                line=dict(width=1, color="black"),
                showlegend=False,
                legendgroup="Female",
            ),
            row=1,
            col=i + 1,
        )

        fig.add_trace(
            go.Box(
                name=assay_label,
                y=y_values[male_idx],
                boxmean=True,
                boxpoints="all",
                hovertemplate="%{text}",
                text=hovertext[male_idx],
                marker=dict(
                    size=2, color=sex_colors["male"], line=dict(width=0.5, color="black")
                ),
                fillcolor=sex_colors["male"],
                line=dict(width=1, color="black"),
                showlegend=False,
                legendgroup="Male",
            ),
            row=1,
            col=i + 1,
        )

    # Add a dummy scatter plot for legend
    fig.add_trace(
        go.Scatter(
            x=[None],
            y=[None],
            mode="markers",
            name="Female",
            marker=dict(color=sex_colors["female"], size=20),
            showlegend=True,
            legendgroup="Female",
        )
    )
    fig.add_trace(
        go.Scatter(
            x=[None],
            y=[None],
            mode="markers",
            name="Male",
            marker=dict(color=sex_colors["male"], size=20),
            showlegend=True,
            legendgroup="Male",
        )
    )

    fig.update_xaxes(showticklabels=False)
    fig.update_yaxes(range=[-1.5, 3], showticklabels=True)

    fig.update_layout(
        width=1500,
        height=750,
    )

    # Save figure
    if logdir:
        if name is None:
            name = "zscore_distributions_per_assay"
        fig.write_image(logdir / f"{name}.svg")
        fig.write_image(logdir / f"{name}.png")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

Graph zscore per assay.

Code
zscore_per_assay(cross_val_analysis)

Supp. Fig. 5B: Distribution of average z-score signal of epigenomes (dots) over chrY per sex (female in red, male in blue) for each assay individually (showing only the fold change track type for the ChIP datasets, and the two types of WGBS and RNA-seq were merged). Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.

To see RNA-Seq and WGBS, scroll to the right for using horizontal scrollbar at the bottom.

C - Female/Male chrY signal z-score cluster separation

Define function merged_assays_separation_distance that computes and graphs the showing separation distance between male/female zscore clusters.

Code
def merged_assays_separation_distance(
    zscore_df: pd.DataFrame, logdir: Path | None = None, name: str | None = None
) -> None:
    """Complement to figure 2E, showing separation distance (mean, median)
    between male/female zscore clusters, for ChIP-seq (core7). Grouped by EpiRR.

    Args:
        zscore_df (pd.DataFrame): The dataframe with z-score data.
        logdir (Path): The directory path to save the output plots.
        name (str): The base name for the output plot files.
    """
    metric_label = "chrY_zscore_vs_assay_track"

    # Preprocessing
    zscore_df = zscore_df.copy(deep=True)
    zscore_df.replace({ASSAY: ASSAY_MERGE_DICT}, inplace=True)

    zscore_df = zscore_df[zscore_df[ASSAY].isin(CORE7_ASSAYS)]  # type: ignore

    # Remove pval/raw tracks
    zscore_df = zscore_df[~zscore_df["track_type"].isin(["pval", "raw"])]

    # Average chrY z-score values
    mean_chrY_values_df = zscore_df.groupby(["EpiRR", SEX]).agg(
        {metric_label: "mean", "Max pred": "mean"}
    )
    mean_chrY_values_df.reset_index(inplace=True)
    if not mean_chrY_values_df["EpiRR"].is_unique:
        raise ValueError("EpiRR is not unique.")

    mean_chrY_values_df.reset_index(drop=True, inplace=True)

    distances = {"mean": [], "median": []}
    min_preds = list(np.arange(0, 1.0, 0.01)) + [0.999]
    sample_count = []
    for min_pred in min_preds:
        subset_chrY_values_df = mean_chrY_values_df[
            mean_chrY_values_df["Max pred"] > min_pred
        ]
        sample_count.append(subset_chrY_values_df.shape[0])

        # Compute separation distances
        chrY_vals_female = subset_chrY_values_df[subset_chrY_values_df[SEX] == "female"][
            metric_label
        ]
        chrY_vals_male = subset_chrY_values_df[subset_chrY_values_df[SEX] == "male"][
            metric_label
        ]

        if not chrY_vals_female.empty and not chrY_vals_male.empty:
            mean_distance = np.abs(chrY_vals_female.mean() - chrY_vals_male.mean())
            median_distance = np.abs(chrY_vals_female.median() - chrY_vals_male.median())

            distances["mean"].append(mean_distance)
            distances["median"].append(median_distance)
        else:
            distances["mean"].append(np.nan)
            distances["median"].append(np.nan)

    # Plotting the results
    fig = go.Figure()

    # Add traces for mean and median distances
    fig.add_trace(
        go.Scatter(
            x=min_preds,
            y=distances["mean"],
            mode="lines+markers",
            name="Mean Distance (left)",
            line=dict(color="blue"),
        )
    )
    fig.add_trace(
        go.Scatter(
            x=min_preds,
            y=distances["median"],
            mode="lines+markers",
            name="Median Distance (left)",
            line=dict(color="green"),
        )
    )

    # Add trace for number of files
    fig.add_trace(
        go.Scatter(
            x=min_preds,
            y=np.array(sample_count) / max(sample_count),
            mode="lines+markers",
            name="Proportion of samples (right)",
            line=dict(color="red"),
            yaxis="y2",
        )
    )

    fig.update_xaxes(range=[0.499, 1.0])

    # Update layout for secondary y-axis
    fig.update_layout(
        title="Separation Distance of chrY z-scores male/female clusters - ChIP-Seq",
        xaxis_title="Average Prediction Score minimum threshold",
        yaxis_title="Z-score Distance",
        yaxis2=dict(title="Proportion of samples", overlaying="y", side="right"),
        yaxis2_range=[0, 1.001],
        legend=dict(
            x=1.08,
        ),
    )

    # Save figure
    if logdir:
        if name is None:
            name = "zscore_cluster_separation_distance"
        fig.write_image(logdir / f"{name}.svg")
        fig.write_image(logdir / f"{name}.png")
        fig.write_html(logdir / f"{name}.html")

    fig.show()

Graph.

Code
merged_assays_separation_distance(cross_val_analysis)

Supp. Fig. 5C: Effect of a prediction score threshold on the aggregated mean (blue) and median (green) sex z-score male/female cluster distances, as well as and corresponding file subset size (red) of ChIP-related assays from panel B.

D - GP-Age, all samples

Considering all samples that had a conclusive epiclass prediction (or existing label and no conclusion)

Code
display_tissue_age_df(gp_graph_df)

graph_gp_age(
    df_gp_age=gp_graph_df,
    name="GP_age_predictions_all_samples_MLP_predicted_labels",
    title="All samples (has a life stage pred + gp-age)",
)
Tissue group summary
Blood: 239
Other: 326
Total: 565
Detail:
tissue_group  graph_age_categories
blood         adult                   201
              pediatric                26
              perinatal                12
other         adult                   272
              pediatric                25
              perinatal                29
dtype: int64

Supp. Fig. 5D: Distribution of the age prediction from GP-age for the 565 epigenomes (dots) with conclusive consensus predictions (epigenomes related to blood biospecimen are in red, others in grey). The perinatal category encompasses the original embryonic, fetal and newborn categories of metadata as they individually contain too few samples. Dashed lines represent means, solid lines the medians, boxes the quartiles, and whiskers the farthest points within 1.5× the interquartile range.

E - Sex genes chromatin states

Images extracted from Epilogos viewer, using specified coordinates (XIST and FIRRE positions), and:

  • View mode: Paired
  • Dataset: IHEC
  • Pairwise: Male VS Female 100 samples
  • Saliency Metric: S1

Supp. Fig. 5E: Epilogos pairwise comparisons of male (top) vs female (bottom) showing portions of important regions for the Sex classifier, including the XIST (left) and FIRRE (right) genes.

F - Sex genes chromatin states

Supp. Fig. 5F: Genome browser representation of the important regions shown in Figure 2I.

Supplementary Figure 6 - Prediction score threshold impact

nani?

Supplementary Figure 7 - Biospecimen classifier - ChromScore for high-SHAP regions

See src/python/epiclass/utils/notebooks/analyze_hdf5_vals.ipynb, particularly section “ChromScore hdf5 values”.